University of Strathclyde at Headline Ranking TREC BLOG 2010

نویسنده

  • Dmitri Roussinov
چکیده

The University of Strathclyde participated in TREC BLOG Headline Ranking task only. Our general theme was to explore how the lexical changes in the BLOG corpus can reflect the importance of the articles appearing in the news corpus. Three (3) runs were submitted. For automated run "strath1", our algorithm identified the word unigrams, the frequencies of mentioning of which in the blog corpus increased substantially on the day of the query. Up to 100 such words were used as a query to return and rank the headlines using the Terrier platform and its PL2 model. Automated run “strath3” was similar to “strath1” except the weights were estimated based on the amount of the increase in the frequency of use and applied to the query words. “Strath2” was a manual run. Event descriptions were taken from the “Current Event” articles of Wikipedia Portal on the day of each topic (e.g. 22 January 2008 for the first topic). The description of each event was sent to Bing search engine. The words that occurred more frequently within the snippets than on the Web in average, were used to query the headlines corpus. Our participation was in close collaboration with the University of Glasgow group, which provided 1) the index to the news corpus 2) the daily statistics on the lexicons of the blog corpus and 3) the classification of the headlines into the required set of categories. HEADLINE RANKING TASK The top stories identification task (headline ranking) was first run as a pilot task in TREC 2009 to address the news dimension of the blogosphere (Macdonald et al., 2009). The task was exploring whether the blogosphere can be used to identify the most important news stories for a given day. In response to a date "query", systems were expected to provide a ranking of 100 headlines (stories) that they think were important on the specified day. For this year, Thomson-Reuters provided the TRC2 newswire corpus covering the same time-span as the corpus used in 2009. Not only the newer TRC2 corpus was much larger, it also included the full content of each story. The corpus was distributed by NIST for the TREC participants free of charge. Different from 2009, this task was treated as an online event detection: only the data created prior to the query date (e.g., an ontology or a Wikipedia article) was allowed to be used by an automatic run. Two main approaches were used during 2009 TREC to identify top stories (Macdonald et al., 2009; Balog et al., 2006): (i) News to Blogs, where mentioning of the headline in the blogs was typically counted as a vote for its importance, and (ii) Blogs to News, which generally proceed by the following steps: 1. observe the blog posts from the given date; 2. detect what differentiates these posts from the previous posts; 3. identify the emerging topics; 4. rank the headlines by their similarity to the emerging topics. The overall observation was that “Blogs to News” approaches worked better. This motivated our involvement and specific choices of techniques to explore as we elaborate in the next section. “Related Work” section follows our “Discussion Of Results,” followed, in turn, by “Conclusions, Limitations And Future Directions.”

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

From Blogs to News: Identifying Hot Topics in the Blogosphere

We describe the participation of the University of Amsterdam’s ILPS group in the blog track at TREC 2009. We focus on the top stories identification task, and take an approach that does not require the headlines of top stories to be known beforehand. We explore the feasibility of a so-called blogs to news approach: given a date and a set of blog posts, identify the main topics for that date. Th...

متن کامل

PRIS at TREC 2010 Blog Track: Faceted Blog Distillaton

This paper presents the system adopted for the Faceted Blog Distillation task by PRIS team. The PRIS system is submitted by Pattern Recognition and Intelligent System Lab at Beijing University of Posts and Telecommunications. And a two-stage strategy is involved for this task. First, an adaptable Voting Model is carried out for blog distillation. Then, different models are designed to judge the...

متن کامل

FEUP at TREC 2010 Blog Track: Using h-index for blog ranking

This paper describes the participation of FEUP, from the University of Porto, in the TREC 2010 Blog Track. FEUP participated in the baseline blog distillation task with work focused on the use of link features available in the TREC Blogs08 collection. The approach presented in this paper uses the link information available in most individual posts to amplify each post’s score. Blog scores, and ...

متن کامل

PRIS at TREC 2010 Blog Track: Faceted Blog Distillation

This paper presents the system adopted for the Faceted Blog Distillation task by PRIS team. The PRIS system is submitted by Pattern Recognition and Intelligent System Lab at Beijing University of Posts and Telecommunications. And a two-stage strategy is involved for this task. First, an adaptable Voting Model is carried out for blog distillation. Then, different models are designed to judge the...

متن کامل

University of Glasgow at TREC 2010: Experiments with Terrier in Blog and Web Tracks

In TREC 2010, we continue to build upon the Voting Model and experiment with our novel xQuAD framework within the auspices of the Terrier IR Platform. In particular, our focus is the development of novel applications for data-driven learning in the Blog and Web tracks, with experimentation spanning hundreds of features. In the Blog track, we propose novel feature sets for the ranking of blogs, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010